Skip to content

repository: make check and delete work at N=1 with sha256 pack ids#9783

Open
mr-raj12 wants to merge 1 commit into
borgbackup:masterfrom
mr-raj12:repository-check-delete-sha256-pack-ids
Open

repository: make check and delete work at N=1 with sha256 pack ids#9783
mr-raj12 wants to merge 1 commit into
borgbackup:masterfrom
mr-raj12:repository-check-delete-sha256-pack-ids

Conversation

@mr-raj12

@mr-raj12 mr-raj12 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

borg check, delete and compact all assumed that a pack file's name is the chunk_id it holds. That is true only at N=1 while pack ids equal chunk ids. With BORG_TESTONLY_SHA256_PACK_ID=1 (and later at N>1) a pack is named sha256(pack_bytes), so the file name is the pack_id, not the chunk_id.

The object header already carries the chunk_id, meta size and data size (repoobj.py), so the chunk index can be rebuilt from the headers (real chunk_ids) instead of the file names, and removal can find the pack through the index.

Changes:

  • repoobj.py: add RepoObj.iter_object_headers(), which walks a pack's object headers and yields (chunk_id, obj_offset, obj_size). Works for one object per pack (N=1) and for several (N>1).
  • repository.py:
    • delete() becomes a logged no-op. A pack may hold several objects, so we can not remove one object by dropping its pack without losing the others; real removal goes through store_delete at the pack level.
    • list() iterates the chunk index and yields chunk_ids, instead of listing packs/ (which would yield pack_ids).
  • cache.py: build_chunkindex_from_repo() rebuilds the index from store_list + iter_object_headers (real chunk_id/offset/size) rather than calling Repository.list(), which would recurse into the very index it is building. Adds an init_flags argument so compact can start the chunks as unused and compute usage itself.
  • compact_cmd.py: --stats builds the index from a real header scan instead of a pack_id == chunk_id fake; the delete loop drops each unused chunk's pack via store_delete.
  • archive.py / debug_cmd.py: check --repair and debug delete-obj resolve the pack via the chunk index and then store_delete it; debug delete-obj maps a missing pack (stale index entry) back to "not found".
  • tests: a delete_chunk helper drops a chunk's pack at the store level (used by the check / extract / mount damage tests), since Repository.delete no longer removes anything; test_empty_repository drops packs through the store; test objects get a unique chunk_id so identical payloads do not collapse into one pack under sha256.

Verified with the local test subset, with and without BORG_TESTONLY_SHA256_PACK_ID=1: the repository- and archive-level missing-chunk checks pass under the switch and nothing regresses without it. Archive --verify-data corruption detection under sha256 pack ids (a few check --verify-data tests) is left for a separate change.

refs #8572

Checklist

  • PR is against master
  • New code has tests and docs where appropriate
  • Tests pass (ran the relevant test subset)
  • Commit messages are clean and reference related issues

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.21053% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.84%. Comparing base (2eef6c8) to head (4121509).
⚠️ Report is 14 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/borg/cache.py 69.23% 3 Missing and 1 partial ⚠️
src/borg/repository.py 83.33% 1 Missing and 2 partials ⚠️
src/borg/archive.py 50.00% 1 Missing ⚠️
src/borg/archiver/debug_cmd.py 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9783      +/-   ##
==========================================
- Coverage   84.88%   84.84%   -0.05%     
==========================================
  Files          92       92              
  Lines       15172    15166       -6     
  Branches     2275     2281       +6     
==========================================
- Hits        12879    12867      -12     
- Misses       1589     1594       +5     
- Partials      704      705       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

Comment thread src/borg/testsuite/archiver/check_cmd_test.py Outdated
Comment thread src/borg/testsuite/repository_test.py Outdated
Comment thread src/borg/repoobj.py Outdated
Comment thread src/borg/repository.py Outdated
@mr-raj12 mr-raj12 force-pushed the repository-check-delete-sha256-pack-ids branch from 28c817b to f2ca1ff Compare June 16, 2026 20:34
A pack is named sha256(pack_bytes), not chunk_id. delete becomes a no-op
(real removal is store_delete at the pack level); list iterates the chunk
index; build_chunkindex_from_repo reads pack headers for real chunk_ids and
gains init_flags so compact computes usage; tests drop a chunk's pack.
@mr-raj12 mr-raj12 force-pushed the repository-check-delete-sha256-pack-ids branch from f2ca1ff to 4121509 Compare June 16, 2026 20:39

@ThomasWaldmann ThomasWaldmann left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stuff I found.

chunk index, since the pack file name is the pack_id, which need not equal the chunk_id.
"""
entry = repository.chunks.get(id)
repository.store_delete("packs/" + bin_to_hex(entry.pack_id))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if entry is not None: ... else: ...

Comment on lines +87 to +89
# delete is a no-op now (see Repository.delete), so the object stays retrievable.
repository.delete(key50)
with pytest.raises(Repository.ObjectNotFound):
repository.get(key50)
assert pdchunk(repository.get(key50)) == b"SOMEDATA"

@ThomasWaldmann ThomasWaldmann Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rather don't test Repository.delete at all than testing some behaviour that doesn't make sense.

Comment thread src/borg/archive.py
Comment on lines +1865 to +1868
# N=1: the defect chunk is alone in its pack; drop the pack. N>1 needs compaction.
pack_id = self.chunks[defect_chunk].pack_id
del self.chunks[defect_chunk]
self.repository.delete(defect_chunk)
self.repository.store_delete("packs/" + bin_to_hex(pack_id))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what part of "don't do actions that are very harmful for the general case" didn't you understand?

you are killing a complete pack here because one chunk was problematic.

Comment thread src/borg/cache.py
Comment on lines +658 to +665
for info in repository.store_list("packs"):
pack_id = hex_to_bin(info.name)
pack = repository.store_load("packs/" + info.name)
for chunk_id, obj_offset, obj_size in RepoObj.iter_object_headers(pack):
num_chunks += 1
chunks[chunk_id] = ChunkIndexEntry(
flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will read all the data that is stored in all packs.

There needs to be at least a TODO to improve efficiency by only loading the RepoObj headers from the packfile and not also all metadata and data.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also bit tricky considering how the borgstore internal caching works right now (IF the Store is configured with a cache): on first access, it will load the complete pack. So even if borg would only read little from the pack, it would fetch the whole pack anyway.

Comment thread src/borg/cache.py
flags=init_flags, size=0, pack_id=pack_id, obj_offset=obj_offset, obj_size=obj_size
)
else:
# Legacy repo: list() reads its own segment index (no recursion). get() routes through that

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That "segment index" used to be called "repository index" in borg1.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that legacy part even used? By what?

Comment thread src/borg/repoobj.py
return data[hdr_size + hdr.meta_size :] # crypted data

@classmethod
def iter_object_headers(cls, pack: bytes):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problematic: API requires giving the complete pack contents.

Could do partial loads to only load the interesting parts of the pack and skip 99% of the data it does not need.

Comment thread src/borg/repository.py
if key <= last_key_checked: # needs sorted keys
continue
try:
obj = self.store.load(key)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping the old name here is problematic. obj would make readers think this is a repoobj, while in reality it is a complete pack.

Comment thread src/borg/repository.py
info = dict(id=self.id, version=self.version)
return info

def check(self, repair=False, max_duration=0):

@ThomasWaldmann ThomasWaldmann Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess Repository.check rather needs a rewrite than the little changes you do here.

context:

borg1 check --repository-only for a remote (ssh:) repository ran the Repository code ON THE REPO SERVER. Thus, reading all data was just local file I/O on the server and acceptable.

borg2 does most things client-side by default and the remoting happens inside borgstore - the Repository class code is NOT running on the repo server, but always on the client. Thus, reading all data from all packs is not acceptable because it would transfer everything over the network. Some borgstore code IS running on the repo server though (at least for "rest:" repos), e.g. Store.hash can be done repo-server-side.

For borg2 check --repository-only we could do it like this:

  • iterate over all pack ids
  • use store.hash(packname) to compute the hash from the data ON THE REPO SERVER
  • compare computed hash with pack id
  • if it matches, the pack does not have corruption.
  • if it does not match, the pack is corrupted. without --repair just report that. with --repair we need to fetch it, iterate over all repoobjs and parse them into meta and data (this decrypts and find corrupted repoobjs because the AEAD authentication+decryption fails in the authentication step). drop corrupted repoobjs, write good repoobjs into new pack, update the index.

The index could be verified in a similar way, just comparing hashes. If there is a mismatch, emit a warning so that the user will rebuild the index.

Guess borg check without --repair would not rebuild an index by default (too expensive/slow), but just assume that the index is ok IF the hashes are ok.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not necessarily need to be in this PR, you decide.

Comment thread src/borg/repository.py
# <marker> ourselves; index order is stable unless the index is mutated, which is all the
# marker pagination needs.
self._lock_refresh()
collect = marker is None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a better expression than the original code.

@ThomasWaldmann

ThomasWaldmann commented Jun 16, 2026

Copy link
Copy Markdown
Member

Good, failures down to 6 in sha256-packid mode. \o/

borg check --archives-only --verify-data seems to be broken, it exits with 0 although a file chunk is corrupted.

Maybe check if that is a indexing problem (index not updated after put(chunkid, corrupted_data)) or if there is something wrong in the check code still.

Hmm, maybe this is the wrong way to test that. The idea of that test was to simulate a corrupted file chunk (like bit rot), but that put() also writes it to a new packid and also it does a put for a chunkid we already have in the repo (that is an action that borg usually never does).

Guess it would be better and also more realistic to low-level corrupt the chunk data inside the pack without using put() and without touching the index.

@ThomasWaldmann

Copy link
Copy Markdown
Member

BTW, if you address review feedback, can you rather do new commits than amending please - that is easier for review.

At the end, git rebase -i can be used to squash stuff together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants